Creating Cohorts of Songs¶

Course-end Project-3

Name-Ankit Malhotra

Objective- As a data scientist, you should perform exploratory data analysis and perform cluster analysis to create cohorts of songs. The goal is to gain a better understanding of the various factors that contribute to create a cohort of songs.

In [1]:
# Importing the libraries 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

EDA- Exploratory Data Analysis¶

In [2]:
df = pd.read_excel('1673873388_rolling_stones_spotify.xlsx')
In [3]:
df
Out[3]:
Unnamed: 0 name album release_date track_number id uri acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 0 Concert Intro Music - Live Licked Live In NYC 2022-06-10 1 2IEkywLJ4ykbhi1yRQvmsT spotify:track:2IEkywLJ4ykbhi1yRQvmsT 0.0824 0.463 0.993 0.996000 0.9320 -12.913 0.1100 118.001 0.0302 33 48640
1 1 Street Fighting Man - Live Licked Live In NYC 2022-06-10 2 6GVgVJBKkGJoRfarYRvGTU spotify:track:6GVgVJBKkGJoRfarYRvGTU 0.4370 0.326 0.965 0.233000 0.9610 -4.803 0.0759 131.455 0.3180 34 253173
2 2 Start Me Up - Live Licked Live In NYC 2022-06-10 3 1Lu761pZ0dBTGpzxaQoZNW spotify:track:1Lu761pZ0dBTGpzxaQoZNW 0.4160 0.386 0.969 0.400000 0.9560 -4.936 0.1150 130.066 0.3130 34 263160
3 3 If You Can't Rock Me - Live Licked Live In NYC 2022-06-10 4 1agTQzOTUnGNggyckEqiDH spotify:track:1agTQzOTUnGNggyckEqiDH 0.5670 0.369 0.985 0.000107 0.8950 -5.535 0.1930 132.994 0.1470 32 305880
4 4 Don’t Stop - Live Licked Live In NYC 2022-06-10 5 7piGJR8YndQBQWVXv6KtQw spotify:track:7piGJR8YndQBQWVXv6KtQw 0.4000 0.303 0.969 0.055900 0.9660 -5.098 0.0930 130.533 0.2060 32 305106
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1605 1605 Carol The Rolling Stones 1964-04-16 8 08l7M5UpRnffGl0FyuRiQZ spotify:track:08l7M5UpRnffGl0FyuRiQZ 0.1570 0.466 0.932 0.006170 0.3240 -9.214 0.0429 177.340 0.9670 39 154080
1606 1606 Tell Me The Rolling Stones 1964-04-16 9 3JZllQBsTM6WwoJdzFDLhx spotify:track:3JZllQBsTM6WwoJdzFDLhx 0.0576 0.509 0.706 0.000002 0.5160 -9.427 0.0843 122.015 0.4460 36 245266
1607 1607 Can I Get A Witness The Rolling Stones 1964-04-16 10 0t2qvfSBQ3Y08lzRRoVTdb spotify:track:0t2qvfSBQ3Y08lzRRoVTdb 0.3710 0.790 0.774 0.000000 0.0669 -7.961 0.0720 97.035 0.8350 30 176080
1608 1608 You Can Make It If You Try The Rolling Stones 1964-04-16 11 5ivIs5vwSj0RChOIvlY3On spotify:track:5ivIs5vwSj0RChOIvlY3On 0.2170 0.700 0.546 0.000070 0.1660 -9.567 0.0622 102.634 0.5320 27 121680
1609 1609 Walking The Dog The Rolling Stones 1964-04-16 12 43SkTJJ2xleDaeiE4TIM70 spotify:track:43SkTJJ2xleDaeiE4TIM70 0.3830 0.727 0.934 0.068500 0.0965 -8.373 0.0359 125.275 0.9690 35 189186

1610 rows × 18 columns

In [4]:
df = df.drop(['Unnamed: 0'], axis=1)
In [5]:
df.shape
Out[5]:
(1610, 17)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1610 entries, 0 to 1609
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   name              1610 non-null   object        
 1   album             1610 non-null   object        
 2   release_date      1610 non-null   datetime64[ns]
 3   track_number      1610 non-null   int64         
 4   id                1610 non-null   object        
 5   uri               1610 non-null   object        
 6   acousticness      1610 non-null   float64       
 7   danceability      1610 non-null   float64       
 8   energy            1610 non-null   float64       
 9   instrumentalness  1610 non-null   float64       
 10  liveness          1610 non-null   float64       
 11  loudness          1610 non-null   float64       
 12  speechiness       1610 non-null   float64       
 13  tempo             1610 non-null   float64       
 14  valence           1610 non-null   float64       
 15  popularity        1610 non-null   int64         
 16  duration_ms       1610 non-null   int64         
dtypes: datetime64[ns](1), float64(9), int64(3), object(4)
memory usage: 214.0+ KB
In [7]:
df.describe()
Out[7]:
release_date track_number acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
count 1610 1610.000000 1610.000000 1610.000000 1610.000000 1610.000000 1610.00000 1610.000000 1610.000000 1610.000000 1610.000000 1610.000000 1610.000000
mean 1992-04-23 12:57:14.534161536 8.613665 0.250475 0.468860 0.792352 0.164170 0.49173 -6.971615 0.069512 126.082033 0.582165 20.788199 257736.488199
min 1964-04-16 00:00:00 1.000000 0.000009 0.104000 0.141000 0.000000 0.02190 -24.408000 0.023200 46.525000 0.000000 0.000000 21000.000000
25% 1970-09-04 00:00:00 4.000000 0.058350 0.362250 0.674000 0.000219 0.15300 -8.982500 0.036500 107.390750 0.404250 13.000000 190613.000000
50% 1986-03-24 00:00:00 7.000000 0.183000 0.458000 0.848500 0.013750 0.37950 -6.523000 0.051200 124.404500 0.583000 20.000000 243093.000000
75% 2017-12-01 00:00:00 11.000000 0.403750 0.578000 0.945000 0.179000 0.89375 -4.608750 0.086600 142.355750 0.778000 27.000000 295319.750000
max 2022-06-10 00:00:00 47.000000 0.994000 0.887000 0.999000 0.996000 0.99800 -1.014000 0.624000 216.304000 0.974000 80.000000 981866.000000
std NaN 6.560220 0.227397 0.141775 0.179886 0.276249 0.34910 2.994003 0.051631 29.233483 0.231253 12.426859 108333.474920
In [8]:
df.isnull().sum()
Out[8]:
name                0
album               0
release_date        0
track_number        0
id                  0
uri                 0
acousticness        0
danceability        0
energy              0
instrumentalness    0
liveness            0
loudness            0
speechiness         0
tempo               0
valence             0
popularity          0
duration_ms         0
dtype: int64
In [9]:
df.columns
Out[9]:
Index(['name', 'album', 'release_date', 'track_number', 'id', 'uri',
       'acousticness', 'danceability', 'energy', 'instrumentalness',
       'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'popularity',
       'duration_ms'],
      dtype='object')
In [10]:
df.dtypes
Out[10]:
name                        object
album                       object
release_date        datetime64[ns]
track_number                 int64
id                          object
uri                         object
acousticness               float64
danceability               float64
energy                     float64
instrumentalness           float64
liveness                   float64
loudness                   float64
speechiness                float64
tempo                      float64
valence                    float64
popularity                   int64
duration_ms                  int64
dtype: object
In [11]:
df
Out[11]:
name album release_date track_number id uri acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 Concert Intro Music - Live Licked Live In NYC 2022-06-10 1 2IEkywLJ4ykbhi1yRQvmsT spotify:track:2IEkywLJ4ykbhi1yRQvmsT 0.0824 0.463 0.993 0.996000 0.9320 -12.913 0.1100 118.001 0.0302 33 48640
1 Street Fighting Man - Live Licked Live In NYC 2022-06-10 2 6GVgVJBKkGJoRfarYRvGTU spotify:track:6GVgVJBKkGJoRfarYRvGTU 0.4370 0.326 0.965 0.233000 0.9610 -4.803 0.0759 131.455 0.3180 34 253173
2 Start Me Up - Live Licked Live In NYC 2022-06-10 3 1Lu761pZ0dBTGpzxaQoZNW spotify:track:1Lu761pZ0dBTGpzxaQoZNW 0.4160 0.386 0.969 0.400000 0.9560 -4.936 0.1150 130.066 0.3130 34 263160
3 If You Can't Rock Me - Live Licked Live In NYC 2022-06-10 4 1agTQzOTUnGNggyckEqiDH spotify:track:1agTQzOTUnGNggyckEqiDH 0.5670 0.369 0.985 0.000107 0.8950 -5.535 0.1930 132.994 0.1470 32 305880
4 Don’t Stop - Live Licked Live In NYC 2022-06-10 5 7piGJR8YndQBQWVXv6KtQw spotify:track:7piGJR8YndQBQWVXv6KtQw 0.4000 0.303 0.969 0.055900 0.9660 -5.098 0.0930 130.533 0.2060 32 305106
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1605 Carol The Rolling Stones 1964-04-16 8 08l7M5UpRnffGl0FyuRiQZ spotify:track:08l7M5UpRnffGl0FyuRiQZ 0.1570 0.466 0.932 0.006170 0.3240 -9.214 0.0429 177.340 0.9670 39 154080
1606 Tell Me The Rolling Stones 1964-04-16 9 3JZllQBsTM6WwoJdzFDLhx spotify:track:3JZllQBsTM6WwoJdzFDLhx 0.0576 0.509 0.706 0.000002 0.5160 -9.427 0.0843 122.015 0.4460 36 245266
1607 Can I Get A Witness The Rolling Stones 1964-04-16 10 0t2qvfSBQ3Y08lzRRoVTdb spotify:track:0t2qvfSBQ3Y08lzRRoVTdb 0.3710 0.790 0.774 0.000000 0.0669 -7.961 0.0720 97.035 0.8350 30 176080
1608 You Can Make It If You Try The Rolling Stones 1964-04-16 11 5ivIs5vwSj0RChOIvlY3On spotify:track:5ivIs5vwSj0RChOIvlY3On 0.2170 0.700 0.546 0.000070 0.1660 -9.567 0.0622 102.634 0.5320 27 121680
1609 Walking The Dog The Rolling Stones 1964-04-16 12 43SkTJJ2xleDaeiE4TIM70 spotify:track:43SkTJJ2xleDaeiE4TIM70 0.3830 0.727 0.934 0.068500 0.0965 -8.373 0.0359 125.275 0.9690 35 189186

1610 rows × 17 columns

In [12]:
# Specify numeric columns explicitly
numeric_columns = ['release_date', 'track_number', 'acousticness', 'danceability', 'energy', 'instrumentalness',
       'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'popularity',
       'duration_ms']  # Replace with actual column names
# Calculate correlation
correlation_matrix = df[numeric_columns].corr()
In [13]:
sns.heatmap(correlation_matrix)
Out[13]:
<Axes: >
In [14]:
df['track_number'].value_counts().plot(kind='barh', figsize=(10,15))
plt.xlabel('counts')
plt.ylabel('Track Number')
plt.title('Number of Track Number')
plt.show()
In [15]:
sns.scatterplot(x=df['acousticness'],y=df['energy'])
plt.title('Acoustickness vs Enenrgy')
plt.show()
In [16]:
sns.scatterplot(x=df['track_number'],y=df['energy'])
plt.title('Track Number vs Enenrgy')
plt.show()
In [17]:
sns.scatterplot(x=df['energy'],y=df['danceability'])
plt.title('Energy vs Dance ability')
plt.show()
In [18]:
plt.figure(figsize=(8,8),dpi=100)
plt.scatter(x=df['liveness'],y=df['loudness'])
plt.xlabel('Liveness')
plt.ylabel('Loudness')
plt.title('Liveness vs Loudenss')
plt.show()
In [19]:
sns.scatterplot(x=df['liveness'],y=df['popularity'])
plt.title('Liveness vs Popularity')
plt.show()
In [20]:
sns.scatterplot(x=df['energy'],y=df['liveness'],hue=df['popularity'])
plt.title('Energy vs Liveness with respect to Popularity')
plt.show()
In [21]:
sns.histplot(df['duration_ms'])
plt.title('MS Duration Count')
plt.show()
In [22]:
sns.scatterplot(x=df['tempo'],y=df['valence'],hue=df['loudness'])
plt.title('Tempo vs Valence with respect with loudness')
plt.show()
In [23]:
sns.jointplot(x=df['track_number'],y=df['popularity'])
plt.show()
In [24]:
sns.scatterplot(x=df['danceability'],y=df['energy'],hue=df['liveness'])
plt.title('Dance Ability vs Energy with respect to Liveness')
plt.show()
In [25]:
sns.jointplot(x=df['danceability'],y=df['energy'],kind='hex')
plt.show()
In [26]:
sns.jointplot(x=df['instrumentalness'],y=df['tempo'])
plt.show()
In [27]:
plt.figure(figsize=(10,20))
sns.countplot(y=df['album'])
plt.title('Album')
plt.show()
In [28]:
plt.figure(figsize=(10,20))
sns.scatterplot(y=df['album'],x=df['popularity'])
plt.title('Album vs Popularity')
plt.show()
In [29]:
plt.figure(figsize=(10,20))
sns.scatterplot(y=df['album'],x=df['popularity'],hue=df['track_number'])
plt.title('Album vs Popularity with respect to Track Number')
plt.show()
In [30]:
plt.figure(figsize=(8,12))
sns.countplot(y=df['release_date'])
plt.title('Number of Songs Released in a date')
plt.show()
In [31]:
sns.boxplot(df['popularity'])
plt.show()
In [32]:
plt.figure(figsize=(15,5))
sns.boxplot(x=df['popularity'],y=df['track_number'])
plt.show()
In [33]:
plt.figure(figsize=(15,5))
sns.barplot(x=df['track_number'],y=df['energy'])
plt.show()
In [34]:
sns.boxplot(df['duration_ms'])
Out[34]:
<Axes: >
In [35]:
cols = ['track_number','energy','popularity','liveness']
sns.pairplot(df,vars=cols)
plt.show()
/Users/ankitmalhotra/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
In [36]:
cols = ['track_number','acousticness','danceability','liveness']
sns.pairplot(df,vars=cols,hue='popularity')
plt.show()
/Users/ankitmalhotra/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
In [37]:
plt.figure(figsize=(8,15))
plt.subplots_adjust(hspace=0.5,wspace=0.5)

plt.subplot(5,2,1)
plt.hist(df['energy'])
plt.title('Energy')

plt.subplot(5,2,2)
plt.hist(df['track_number'])
plt.title('Track Number')

plt.subplot(5,2,4)
plt.hist(df['popularity'])
plt.title('Popularity')

plt.subplot(5,2,4)
plt.hist(df['liveness'])
plt.title('Liveness')

plt.subplot(5,2,5)
plt.hist(df['tempo'])
plt.title('Tempo')

plt.subplot(5,5,6)
plt.hist(df['duration_ms'])
plt.title('Duration MS')

plt.subplot(5,2,7)
plt.hist(df['instrumentalness'])
plt.title('Instrumentalness')

plt.subplot(5,2,8)
plt.hist(df['danceability'])
plt.title('Danceability')

plt.subplot(5,2,9)
plt.hist(df['loudness'])
plt.title('Loudness')

plt.subplot(5,2,10)
plt.hist(df['speechiness'])
plt.title('Speechiness')
plt.show()

Cluster Analysis¶

In [38]:
df
Out[38]:
name album release_date track_number id uri acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 Concert Intro Music - Live Licked Live In NYC 2022-06-10 1 2IEkywLJ4ykbhi1yRQvmsT spotify:track:2IEkywLJ4ykbhi1yRQvmsT 0.0824 0.463 0.993 0.996000 0.9320 -12.913 0.1100 118.001 0.0302 33 48640
1 Street Fighting Man - Live Licked Live In NYC 2022-06-10 2 6GVgVJBKkGJoRfarYRvGTU spotify:track:6GVgVJBKkGJoRfarYRvGTU 0.4370 0.326 0.965 0.233000 0.9610 -4.803 0.0759 131.455 0.3180 34 253173
2 Start Me Up - Live Licked Live In NYC 2022-06-10 3 1Lu761pZ0dBTGpzxaQoZNW spotify:track:1Lu761pZ0dBTGpzxaQoZNW 0.4160 0.386 0.969 0.400000 0.9560 -4.936 0.1150 130.066 0.3130 34 263160
3 If You Can't Rock Me - Live Licked Live In NYC 2022-06-10 4 1agTQzOTUnGNggyckEqiDH spotify:track:1agTQzOTUnGNggyckEqiDH 0.5670 0.369 0.985 0.000107 0.8950 -5.535 0.1930 132.994 0.1470 32 305880
4 Don’t Stop - Live Licked Live In NYC 2022-06-10 5 7piGJR8YndQBQWVXv6KtQw spotify:track:7piGJR8YndQBQWVXv6KtQw 0.4000 0.303 0.969 0.055900 0.9660 -5.098 0.0930 130.533 0.2060 32 305106
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1605 Carol The Rolling Stones 1964-04-16 8 08l7M5UpRnffGl0FyuRiQZ spotify:track:08l7M5UpRnffGl0FyuRiQZ 0.1570 0.466 0.932 0.006170 0.3240 -9.214 0.0429 177.340 0.9670 39 154080
1606 Tell Me The Rolling Stones 1964-04-16 9 3JZllQBsTM6WwoJdzFDLhx spotify:track:3JZllQBsTM6WwoJdzFDLhx 0.0576 0.509 0.706 0.000002 0.5160 -9.427 0.0843 122.015 0.4460 36 245266
1607 Can I Get A Witness The Rolling Stones 1964-04-16 10 0t2qvfSBQ3Y08lzRRoVTdb spotify:track:0t2qvfSBQ3Y08lzRRoVTdb 0.3710 0.790 0.774 0.000000 0.0669 -7.961 0.0720 97.035 0.8350 30 176080
1608 You Can Make It If You Try The Rolling Stones 1964-04-16 11 5ivIs5vwSj0RChOIvlY3On spotify:track:5ivIs5vwSj0RChOIvlY3On 0.2170 0.700 0.546 0.000070 0.1660 -9.567 0.0622 102.634 0.5320 27 121680
1609 Walking The Dog The Rolling Stones 1964-04-16 12 43SkTJJ2xleDaeiE4TIM70 spotify:track:43SkTJJ2xleDaeiE4TIM70 0.3830 0.727 0.934 0.068500 0.0965 -8.373 0.0359 125.275 0.9690 35 189186

1610 rows × 17 columns

In [39]:
df.dtypes
Out[39]:
name                        object
album                       object
release_date        datetime64[ns]
track_number                 int64
id                          object
uri                         object
acousticness               float64
danceability               float64
energy                     float64
instrumentalness           float64
liveness                   float64
loudness                   float64
speechiness                float64
tempo                      float64
valence                    float64
popularity                   int64
duration_ms                  int64
dtype: object
In [40]:
X = df.drop(['name','release_date','id','uri'],axis=1)
In [41]:
X
Out[41]:
album track_number acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 Licked Live In NYC 1 0.0824 0.463 0.993 0.996000 0.9320 -12.913 0.1100 118.001 0.0302 33 48640
1 Licked Live In NYC 2 0.4370 0.326 0.965 0.233000 0.9610 -4.803 0.0759 131.455 0.3180 34 253173
2 Licked Live In NYC 3 0.4160 0.386 0.969 0.400000 0.9560 -4.936 0.1150 130.066 0.3130 34 263160
3 Licked Live In NYC 4 0.5670 0.369 0.985 0.000107 0.8950 -5.535 0.1930 132.994 0.1470 32 305880
4 Licked Live In NYC 5 0.4000 0.303 0.969 0.055900 0.9660 -5.098 0.0930 130.533 0.2060 32 305106
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1605 The Rolling Stones 8 0.1570 0.466 0.932 0.006170 0.3240 -9.214 0.0429 177.340 0.9670 39 154080
1606 The Rolling Stones 9 0.0576 0.509 0.706 0.000002 0.5160 -9.427 0.0843 122.015 0.4460 36 245266
1607 The Rolling Stones 10 0.3710 0.790 0.774 0.000000 0.0669 -7.961 0.0720 97.035 0.8350 30 176080
1608 The Rolling Stones 11 0.2170 0.700 0.546 0.000070 0.1660 -9.567 0.0622 102.634 0.5320 27 121680
1609 The Rolling Stones 12 0.3830 0.727 0.934 0.068500 0.0965 -8.373 0.0359 125.275 0.9690 35 189186

1610 rows × 13 columns

In [42]:
y = df['popularity']
In [43]:
y
Out[43]:
0       33
1       34
2       34
3       32
4       32
        ..
1605    39
1606    36
1607    30
1608    27
1609    35
Name: popularity, Length: 1610, dtype: int64
In [44]:
from sklearn.preprocessing import LabelEncoder
In [45]:
le = LabelEncoder()
In [46]:
X['album'] = le.fit_transform(X['album'])
In [47]:
X.head()
Out[47]:
album track_number acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 47 1 0.0824 0.463 0.993 0.996000 0.932 -12.913 0.1100 118.001 0.0302 33 48640
1 47 2 0.4370 0.326 0.965 0.233000 0.961 -4.803 0.0759 131.455 0.3180 34 253173
2 47 3 0.4160 0.386 0.969 0.400000 0.956 -4.936 0.1150 130.066 0.3130 34 263160
3 47 4 0.5670 0.369 0.985 0.000107 0.895 -5.535 0.1930 132.994 0.1470 32 305880
4 47 5 0.4000 0.303 0.969 0.055900 0.966 -5.098 0.0930 130.533 0.2060 32 305106
In [48]:
from sklearn.preprocessing import MinMaxScaler
In [49]:
ms = MinMaxScaler()
In [50]:
cols = X.columns
In [51]:
X = ms.fit_transform(X)
In [52]:
X
Out[52]:
array([[0.52808989, 0.        , 0.08288914, ..., 0.03100616, 0.4125    ,
        0.02876572],
       [0.52808989, 0.02173913, 0.43963279, ..., 0.32648871, 0.425     ,
        0.24162891],
       [0.52808989, 0.04347826, 0.41850584, ..., 0.32135524, 0.425     ,
        0.25202265],
       ...,
       [0.85393258, 0.19565217, 0.3732338 , ..., 0.85728953, 0.375     ,
        0.16139607],
       [0.85393258, 0.2173913 , 0.21830283, ..., 0.54620123, 0.3375    ,
        0.10478048],
       [0.85393258, 0.23913043, 0.38530634, ..., 0.99486653, 0.4375    ,
        0.17503585]])
In [53]:
X = pd.DataFrame(X,columns=cols)
In [54]:
X
Out[54]:
album track_number acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 0.528090 0.000000 0.082889 0.458493 0.993007 1.000000 0.932384 0.491365 0.144474 0.420994 0.031006 0.4125 0.028766
1 0.528090 0.021739 0.439633 0.283525 0.960373 0.233936 0.962094 0.838035 0.087716 0.500239 0.326489 0.4250 0.241629
2 0.528090 0.043478 0.418506 0.360153 0.965035 0.401606 0.956972 0.832350 0.152796 0.492057 0.321355 0.4250 0.252023
3 0.528090 0.065217 0.570419 0.338442 0.983683 0.000107 0.894478 0.806745 0.282623 0.509303 0.150924 0.4000 0.296483
4 0.528090 0.086957 0.402409 0.254151 0.965035 0.056124 0.967216 0.825425 0.116178 0.494808 0.211499 0.4000 0.295677
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1605 0.853933 0.152174 0.157940 0.462324 0.921911 0.006195 0.309497 0.649483 0.032790 0.770502 0.992813 0.4875 0.138500
1606 0.853933 0.173913 0.057939 0.517241 0.658508 0.000002 0.506198 0.640378 0.101698 0.444637 0.457906 0.4500 0.233400
1607 0.853933 0.195652 0.373234 0.876117 0.737762 0.000000 0.046102 0.703044 0.081225 0.297504 0.857290 0.3750 0.161396
1608 0.853933 0.217391 0.218303 0.761175 0.472028 0.000070 0.147628 0.634393 0.064913 0.330483 0.546201 0.3375 0.104780
1609 0.853933 0.239130 0.385306 0.795658 0.924242 0.068775 0.076427 0.685432 0.021138 0.463838 0.994867 0.4375 0.175036

1610 rows × 13 columns

In [55]:
from sklearn.cluster import KMeans
In [56]:
cs = []
for i in range(1,10):
    kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
    kmeans.fit(X)
    cs.append(kmeans.inertia_)
In [57]:
plt.plot(range(1,10),cs)
plt.title('Elbow method')
plt.xlabel('Number of Clusters, k')
plt.ylabel('cs')
plt.show()
In [58]:
kmeans = KMeans(n_clusters=2,random_state=0)
In [59]:
kmeans.fit(X)
/Users/ankitmalhotra/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
Out[59]:
KMeans(n_clusters=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=2, random_state=0)
In [60]:
labels = kmeans.labels_
In [61]:
correct_labels = sum(y==labels)
In [62]:
print('Results {} out of {} samples were correctly labels'.format(correct_labels,y.size))
Results 33 out of 1610 samples were correctly labels
In [63]:
print('Accuracy Score :{0:0.2f}'.format(correct_labels/float(y.size)))
Accuracy Score :0.02
In [ ]: